Revisiting the negative example sampling problem for predicting protein-protein interactions

نویسندگان

  • Yungki Park
  • Edward M. Marcotte
چکیده

MOTIVATION A number of computational methods have been proposed that predict protein-protein interactions (PPIs) based on protein sequence features. Since the number of potential non-interacting protein pairs (negative PPIs) is very high both in absolute terms and in comparison to that of interacting protein pairs (positive PPIs), computational prediction methods rely upon subsets of negative PPIs for training and validation. Hence, the need arises for subset sampling for negative PPIs. RESULTS We clarify that there are two fundamentally different types of subset sampling for negative PPIs. One is subset sampling for cross-validated testing, where one desires unbiased subsets so that predictive performance estimated with them can be safely assumed to generalize to the population level. The other is subset sampling for training, where one desires the subsets that best train predictive algorithms, even if these subsets are biased. We show that confusion between these two fundamentally different types of subset sampling led one study recently published in Bioinformatics to the erroneous conclusion that predictive algorithms based on protein sequence features are hardly better than random in predicting PPIs. Rather, both protein sequence features and the 'hubbiness' of interacting proteins contribute to effective prediction of PPIs. We provide guidance for appropriate use of random versus balanced sampling. AVAILABILITY The datasets used for this study are available at http://www.marcottelab.org/PPINegativeDataSampling. CONTACT [email protected]; [email protected]. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Revisiting Beta 2 Glycoprotein I, the Major Autoantigen in the Antiphospholipid Syndrome

Beta 2 glycoprotein I (β2GPI) is a single chain 50 kDa highly glycosylated glycoprotein at an approximate concentration of 4 μM in cells. The abundance of this protein in plasma and its high state of preservation indicate the important role of this protein in mammalian. In addition, β2GPI has a particular structure in the fifth domain, and is categorized as the major antigen recognized by autoa...

متن کامل

Discovering Domains Mediating Protein Interactions

Background: Protein-protein interactions do not provide any direct information re‌garding the domains within the proteins that mediate the interactions. The majority of proteins are multi domain proteins and the interaction between them is often defined by the pairs of their domains. Most of the former studies focus only on interacting do‌main pairs. However they do not consider the in...

متن کامل

The value of serum level of S100B protein in predicting brain edema in children with diabetes ketoacidosis

Background and Objective: The S100B protein has recently been considered as an important marker for predicting severe brain damage; however, there has been very little evidence of increasing this marker in cerebral edema due to metabolic disorders such as diabetes ketoacidosis (DKA). This study was designed and performed to evaluate the prognostic role of S100B protein in predicting brain edema...

متن کامل

Predicting Protein-Protein Interactions from Multimodal Biological Data Sources via Nonnegative Matrix Tri-Factorization

Protein interactions are central to all the biological processes and structural scaffolds in living organisms, because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Several high-throughput methods, for example, yeast two-hybrid system and mass spectrometry method, can help determine protein interactions, which, however, suffer from hig...

متن کامل

Rabies Infection: An Overview of Lyssavirus-Host Protein Interactions

Viruses are obligatory intracellular parasites that use cell proteins to take the control of the cell functions in order to accomplish their life cycle. Studying the viral-host interactions would increase our knowledge of the viral biology and mechanisms of pathogenesis. Studies on pathogenesis mechanisms of lyssaviruses, which are the causative agents of rabies, have revealed some important ho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 27 21  شماره 

صفحات  -

تاریخ انتشار 2011